{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Copyright 2021 NVIDIA Corporation. All Rights Reserved.\n",
    "#\n",
    "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "# you may not use this file except in compliance with the License.\n",
    "# You may obtain a copy of the License at\n",
    "#\n",
    "#     http://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing, software\n",
    "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "# See the License for the specific language governing permissions and\n",
    "# limitations under the License.\n",
    "# =============================================================================="
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png\" style=\"width: 90px; float: right;\">\n",
    "\n",
    "# MovieLens Data Enrichment\n",
    "\n",
    "In this notebook, we will enrich the MovieLens 25M dataset with poster and movie sypnopsis scrapped from IMDB. If you wish to use synthetic multi-modal data, then proceed to [05-Create-Feature-Store.ipynb](05-Create-Feature-Store.ipynb), synthetic data section.\n",
    "\n",
    "First, we will need to install some extra package for IMDB data collection."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install imdbpy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: restart the kernel for the new package to take effect.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import IPython\n",
    "\n",
    "IPython.Application.instance().kernel.do_shutdown(True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scraping data from IMDB\n",
    "\n",
    "The IMDB API allows the collection of a rich set of multi-modal meta data from the IMDB database, including link to poster, synopsis and plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from imdb import IMDb\n",
    "\n",
    "# create an instance of the IMDb class\n",
    "ia = IMDb()\n",
    "\n",
    "# get a movie and print its director(s)\n",
    "the_matrix = ia.get_movie('0114709')\n",
    "for director in the_matrix['directors']:\n",
    "    print(director['name'])\n",
    "\n",
    "# show all information that are currently available for a movie\n",
    "print(sorted(the_matrix.keys()))\n",
    "\n",
    "# show all information sets that can be fetched for a movie\n",
    "print(ia.get_movie_infoset())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(the_matrix.get('plot'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "the_matrix.get('synopsis')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Collect synopsis for all movies\n",
    "\n",
    "Next, we will collect meta data, including the synopsis, for all movies in the dataset. Note that this process will take a while to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from collections import defaultdict\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "links = pd.read_csv(\"./data/ml-25m/links.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "links.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "links.imdbId.nunique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tqdm import tqdm\n",
    "import pickle\n",
    "from multiprocessing import Process, cpu_count\n",
    "from multiprocessing.managers import BaseManager, DictProxy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "movies = list(links['imdbId'])\n",
    "movies_id = list(links['movieId'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "movies_infos = {}\n",
    "def task(movies, movies_ids, movies_infos):\n",
    "    for i, (movie, movies_id) in tqdm(enumerate(zip(movies, movies_ids)), total=len(movies)):        \n",
    "        try:\n",
    "            movie_info = ia.get_movie(movie)\n",
    "            movies_infos[movies_id] = movie_info\n",
    "        except Exception as e:\n",
    "            print(\"Movie %d download error: \"%movies_id, e)\n",
    "\n",
    "#task(movies, movies_ids, movies_infos)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now collect the movie metadata from IMDB using parallel threads.\n",
    "\n",
    "Please note: with higher thread counts, there is a risk of being blocked by IMDB DoS software."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print ('Gathering movies information from IMDB...')\n",
    "BaseManager.register('dict', dict, DictProxy)\n",
    "manager = BaseManager()\n",
    "manager.start()\n",
    "\n",
    "movies_infos = manager.dict()\n",
    "\n",
    "num_jobs = 5\n",
    "total = len(movies)\n",
    "chunk_size = total // num_jobs + 1\n",
    "processes = []\n",
    "\n",
    "for i in range(0, total, chunk_size):\n",
    "    proc = Process(\n",
    "        target=task,\n",
    "        args=[\n",
    "            movies[i:i+chunk_size],\n",
    "            movies_id[i:i+chunk_size],\n",
    "            movies_infos\n",
    "        ]\n",
    "    )\n",
    "    processes.append(proc)\n",
    "for proc in processes:\n",
    "    proc.start()\n",
    "for proc in processes:\n",
    "    proc.join()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "movies_infos = movies_infos.copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(movies_infos)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('movies_info.pkl', 'wb') as f:\n",
    "    pickle.dump({\"movies_infos\": movies_infos}, f, protocol=pickle.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scraping movie posters\n",
    "\n",
    "The movie metadata also contains link to poster images. We next collect these posters where available.\n",
    "\n",
    "Note: this process will take some time to complete."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from multiprocessing import Process, cpu_count\n",
    "import pickle\n",
    "import subprocess\n",
    "from tqdm import tqdm\n",
    "import os\n",
    "\n",
    "with open('movies_info.pkl', 'rb') as f:\n",
    "    movies_infos = pickle.load(f)['movies_infos']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "COLLECT_LARGE_POSTER = False\n",
    "\n",
    "filelist, targetlist = [], []\n",
    "largefilelist, largetargetlist = [], []\n",
    "\n",
    "for key, movie in tqdm(movies_infos.items(), total=len(movies_infos)):\n",
    "    if 'cover url' in movie.keys():\n",
    "        target_path = './poster_small/%s.jpg'%(movie['imdbID'])\n",
    "        if os.path.exists(target_path):\n",
    "            continue\n",
    "        targetlist.append(target_path)\n",
    "        filelist.append(movie['cover url'])\n",
    "                \n",
    "    # Optionally, collect high-res poster images \n",
    "    if COLLECT_LARGE_POSTER:\n",
    "        if 'full-size cover url' in movie.keys():\n",
    "            target_path = '\"./poster_large/%s.jpg\"'%(movie['imdbID'])\n",
    "            if os.path.exists(target_path):\n",
    "                continue\n",
    "            largetargetlist.append(target_path)\n",
    "            largefilelist.append(movie['full-size cover url'])                                "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def download_task(filelist, targetlist):\n",
    "    for i, (file, target) in tqdm(enumerate(zip(filelist, targetlist)), total=len(targetlist)):        \n",
    "        cmd = 'wget \"%s\" -O %s'%(file, target)\n",
    "        stream = os.popen(cmd)\n",
    "        output = stream.read()\n",
    "        print(output, cmd)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print ('Gathering small posters...')\n",
    "!mkdir ./poster_small\n",
    "\n",
    "num_jobs = 10\n",
    "total = len(filelist)\n",
    "chunk_size = total // num_jobs + 1\n",
    "processes = []\n",
    "\n",
    "for i in range(0, total, chunk_size):\n",
    "    proc = Process(\n",
    "        target=download_task,\n",
    "        args=[\n",
    "            filelist[i:i+chunk_size],\n",
    "            targetlist[i:i+chunk_size],            \n",
    "        ]\n",
    "    )\n",
    "    processes.append(proc)\n",
    "for proc in processes:\n",
    "    proc.start()\n",
    "for proc in processes:\n",
    "    proc.join()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if COLLECT_LARGE_POSTER:\n",
    "    print ('Gathering large posters...')\n",
    "    !mkdir ./poster_large\n",
    "\n",
    "    num_jobs = 32\n",
    "    total = len(largefilelist)\n",
    "    chunk_size = total // num_jobs + 1\n",
    "    processes = []\n",
    "\n",
    "    for i in range(0, total, chunk_size):\n",
    "        proc = Process(\n",
    "            target=download_task,\n",
    "            args=[\n",
    "                largefilelist[i:i+chunk_size],\n",
    "                largetargetlist[i:i+chunk_size],            \n",
    "            ]\n",
    "        )\n",
    "        processes.append(proc)\n",
    "    for proc in processes:\n",
    "        proc.start()\n",
    "    for proc in processes:\n",
    "        proc.join()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls -l poster_small|wc -l"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
